# Analyze topics over different time periods
library(topicmodels)
library(tm)
library(ldatuning)
# Preprocess the corpus
corpus <- Corpus(VectorSource(documents))
dtm <- DocumentTermMatrix(corpus)
# Fit LDA model
lda_model <- LDA(dtm, k = num_topics)Intro to NLP
Intro to text analytics
What is Text mining / analytics?
Text Mining or Text Analytics is the process of deriving meaningful information from natural language text.
— Some text analytics definitions
- Corpus (pl. Corpora) ~ a collection of similar documents | objects that typically contain raw strings annotated with additional metadata and details
- Document ~ collection of sentences
- String ~ in computational approaches, a string is a specific type of data that represents text and is often encoded in specific format, e.g., Latin1 or UTF8.
- Token ~ is a meaningful unit of text, such as a word, that we are interested in using for analysis
- Tokenization ~ splitting text into tokens
- Bigrams/n-gram ~ sentence, or paragraph
- Collocations ~ words that are attracted to each other (and that co-occur or co-locate together), e.g., Merry Christmas, Good Morning, No worries.
- Tidy text ~ “a table with one-token-per-row”
- Stemming ~ A stem is the base part of a word to which affixes can be attached for derivatives
- Lemmatization ~ Similar to stem, but incorporating the meaning (gone|going –> go)
-
Document-term matrix (DTM) ~ rows = documents | cols = words | cells = [0,1]/frequencies. A sparse matrix describing a collection (i.e., a corpus) of documents with one row for each document and one column for each term. (
WDR_com) -
Term-Document matrix (TDM) ~ rows = words | cols = documents | cells = [0,1]/frequencies. A sparse matrix describing a collection (i.e., a corpus) of documents with one row for each document and one column for each term. (
WDR_com)
What is NLP?
A part of computer science and AI that deals with human languages, Natural Language Processing (NLP) has evolved significantly over the past few decades, driven by advances in computational power, machine learning, and the availability of large datasets.
Broadly speaking, these were some key steps in its evolution:
Early Years (1950s - 1980s) - Rule-based Systems (Early NLP systems were based on rule-based methods, which relied on handcrafted rules for tasks like translation, parsing, and information retrieval).
1970s - 1980s: Statistical Methods and Linguistic Models (The introduction of the Chomskyan Linguistic Models influenced NLP research, focusing on syntax and grammar, while statistical methods began to emerge, laying the groundwork for more data-driven approaches)
1990s: Statistical NLP (significant shift towards statistical approaches due to the availability of larger text corpora and more powerful computers, Hidden Markov Models (HMMs) and n-grams became popular for tasks such as part-of-speech tagging, speech recognition, and machine translation)
2000s: Machine Learning and Data-Driven Methods (rise of machine learning in NLP, particularly supervised learning methods: Support Vector Machines (SVMs), Maximum Entropy models, etc. The development of large annotated corpora and platforms fueled progress in areas such as parsing, word sense disambiguation, and sentiment analysis.)
2010s: Deep Learning Revolution (Neural networks, particularly recurrent neural networks (RNNs) and later long short-term memory (LSTM) networks, became the standard for many NLP tasks. The introduction of word embeddings allowed words to be represented as continuous vectors in a high-dimensional space, capturing semantic relationships between them. Convolutional Neural Networks (CNNs) were applied to text classification and other tasks, although they were more commonly used in computer vision. The development of sequence-to-sequence (Seq2Seq) models enabled advancements in machine translation, summarization, and other sequence generation tasks. Transformers outperformed RNNs on many tasks and led to the development of large-scale pre-trained language models.)
Late 2010s - Present: Pre-trained Language Models and NLP at Scale (Pre-trained language models like BERT (2018) by Google and GPT (Generative Pre-trained Transformer) by OpenAI revolutionized NLP by providing powerful, general-purpose models that could be fine-tuned for specific tasks with minimal training data. The concept of transfer learning became central, where models trained on massive datasets could be adapted to specific tasks. ChatGPT, BERT, T5, and FLAN-T5 continue to push the boundaries of what NLP can achieve, leading to increasingly sophisticated and human-like interactions.)
2020s - Future Directions Multimodal models: Integrating NLP with other forms of data, such as images and audio, to create more comprehensive models. Explainability and interpretability: As models grow in complexity, understanding their decision-making processes becomes more important.
— Some commonly used NLP scenarios
- Natural Language Processing (NLP) ~ is an interdisciplinary field in computer science that has specialized on processing natural language data using computational and mathematical methods.
- Network Analysis ~ the most common way to visualize relationships between entities. Networks, also called graphs, consist of nodes (typically represented as dots) and edges (typically represented as lines) and they can be directed or undirected networks.
- Text Classification ~ a supervised learning method of learning and predicting the category or the class of a document given its text content.
- Named Entity Recognition ~ NER is the task of classifying words or key phrases of a text into predefined entities of interest.
- Text Summarization ~ a language generation task of summarizing the input text into a shorter paragraph of text.
- Entailment ~ the task of classifying the binary relation between two natural-language texts, text and hypothesis, to determine if the text agrees with the hypothesis or not.
- Question Answering ~ QA is the task of retrieving or generating a valid answer for a given query in natural language, provided with a passage related to the query.
- Sentence Similarity ~ the process of computing a similarity score given a pair of text documents.
- Embeddings ~ the process of converting a word or a piece of text to a continuous vector space of real number, usually, in low dimension.
- Sentiment Analysis ~ Provides an example of train and use Aspect Based Sentiment Analysis with Azure ML and Intel NLP Architect.
- Semantic Analysis ~ Allows to analyze the semantic (semantics) fo texts. Such analyses often rely on semantic tagsets that are based on word meaning or meaning families/categories.
- Part-of-Speech (PoS) ~ Tagging identifies the word classes of words (e.g., noun, adjective, verb, etc.) in a text and adds part-of-speech tags to each word.
- Topic Modeling ~ Topic Modeling is a machine learning method seeks to answer the question: given a collection of documents, can we identify what they are about? Topic model algorithms look for patterns of co-occurrences of words in documents.
Modeling strategy
Which model could help me detect the change over time of a certain concept in a corpora of company documents
To detect the change over time of a certain concept in a corpus of company documents, you can use a combination of Topic Modeling, Word Embeddings, and Temporal Analysis techniques. Here are some models and methods that can help with this task:
1. Dynamic Topic Models (DTM)
- Dynamic Topic Models extend traditional topic modeling approaches like Latent Dirichlet Allocation (LDA) to account for changes in topics over time.
- DTM models the evolution of topics in a time series, allowing you to see how the prominence of certain topics or concepts shifts over different time periods.
- This is useful if you want to track how the discussion around a concept evolves within your document corpus.
2. Temporal Word Embeddings
- Temporal Word Embeddings are extensions of static word embeddings like Word2Vec or GloVe that capture how the meaning of words changes over time.
- By training word embeddings separately for different time periods (e.g., by splitting your corpus into time segments), you can analyze how the vector representation of a concept shifts over time.
- Models like Temporal Word2Vec or Dynamic Word Embeddings can help identify changes in how a concept is discussed or understood over time.
library(wordVectors)
# Train word2vec model for each time period
model_t1 <- train_word2vec("text_time_period_1.txt", output_file = "vec_t1.bin")
model_t2 <- train_word2vec("text_time_period_2.txt", output_file = "vec_t2.bin")
# Compare word embeddings over time3. BERT with Temporal Fine-tuning
- BERT (Bidirectional Encoder Representations from Transformers) can be fine-tuned on your document corpus, with separate fine-tuning stages for different time periods.
- By comparing the contextual embeddings of the concept across different time periods, you can analyze how the context and usage of the concept have changed over time.
- Alternatively, you can use a time-aware variant like BERTime, which is designed to capture temporal dynamics in textual data.
normally done with python
library(bertR)
# Load pre-trained BERT model and fine-tune on time-specific data
model <- bert_load("bert-base-uncased")
fine_tuned_model <- bert_finetune(model, train_data)
# Generate embeddings and analyze changes over time4. Sentence Transformers with Time-Based Clustering
- Use Sentence Transformers (e.g., SBERT) to generate embeddings for sentences or paragraphs discussing the concept.
- Apply clustering algorithms (with packages like
clusterorfactoextra) to these embeddings over different time periods to see how the clustering of topics around the concept changes. - This approach is effective for tracking nuanced changes in how the concept is discussed at different points in time.
5. Sequential Neural Networks for Temporal Sequence Prediction
- Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) networks can be used to model sequences of documents or sentences over time.
- By training these models to predict future text sequences, you can analyze how the patterns associated with a concept evolve, which can help you detect changes in its importance or context.
library(keras)
library(tensorflow)
# Define and train LSTM model
model <- keras_model_sequential() %>%
layer_lstm(units = 50, return_sequences = TRUE, input_shape = c(time_steps, features)) %>%
layer_dense(units = 1)
model %>% compile(loss = 'mse', optimizer = 'adam')
history <- model %>% fit(x_train, y_train, epochs = 20, batch_size = 32)6. Change Point Detection Algorithms
- Integrating change point detection algorithms (e.g., Bayesian change point detection) with NLP techniques allows you to pinpoint when significant shifts in the usage or context of a concept occur.
- This is particularly useful if you’re interested in identifying specific events or periods that caused a shift in how a concept is discussed.
library(changepoint)
library(bcp)
# Apply change point detection to time series of concept scores
cpt <- cpt.meanvar(your_time_series_data)
plot(cpt)Workflow Example:
Data Collection: The researchers collected a large corpus of CSR reports from Fortune 500 companies spanning multiple years (e.g., from the 1990s to the 2010s).
Preprocessing: The text from these reports was cleaned, tokenized, and prepared for analysis. Segment your document corpus by time (e.g., by year or quarter).
-
Modeling:
- Apply Dynamic Topic Models, e.g. Latent Dirichlet Allocation (LDA) to identify key topics discussed in the reports. By applying LDA to different time segments (e.g., reports from the 1990s, 2000s, 2010s), you could track the prominence of each topic over time.
- or train Temporal Word Embeddings for each time segment.
- Alternatively, fine-tune BERT or Sentence Transformers for each time segment.
ES (Zhang, Kim, and Xing 2015) This study utilizes
Dynamic Topic Models (DTM)to analyze and track changes in market competition by analyzing text data from financial news articles and company reports. The focus is on understanding how discussions about companies and markets evolve over time and how these changes correlate with stock market returns.
-
Temporal Analysis:
- Track the changes in topic distributions or embedding vectors associated with your concept.
- Use clustering, cosine similarity, or other distance metrics to quantify the change over time.
- Apply change point detection to identify periods of significant shift.
-
Word Embeddings:
- Word2Vec models were trained on CSR reports from different time periods to observe how the semantic meaning of key terms (e.g., “sustainability,” “diversity”) evolved.
- By comparing the embeddings of these terms over time, they assessed shifts in how these concepts were discussed.
- Change Detection:
- The study utilized statistical methods to identify significant change points in the emphasis on particular CSR themes. This helped pinpoint specific events or external pressures (like new regulations) that may have driven changes in corporate communication.
By using these models, you can systematically detect and analyze how the concept of interest changes over time in your corpus of company documents.